| Variable Name | Explanation |
|---|---|
| ID | Number corresponding to the precise combination of the features of the model |
| Model Year | Year of the model of the car |
| Make | Brand of the car |
| Model | The model of the car |
| Estimated Annual Petroleum Consumption (Barrels) | Consumption in Petroleum Barrels |
| Fuel Type 1 | First fuel energy source, only source if not an hybrid car |
| City MPG (Fuel Type 1) | Consumption of the car in miles per gallon of fuel when driving in a city, for fuel type 1 |
| Highway MPG (Fuel Type 1) | Consumption of the car in miles per gallon of fuel when driving on a highway, for fuel type 1 |
| Combined MPG (Fuel Type 1) | Combined city and highway car consumption in miles per gallon, for fuel type 1 |
| Fuel Type 2 | Second energy source if hybrid car |
| City MPG (Fuel Type 2) | Consumption of the car in miles per gallon of fuel when driving in a city, for fuel type 2 |
| Highway MPG (Fuel Type 2) | Consumption of the car in miles per gallon of fuel when driving on a highway, for fuel type 2 |
| Combined MPG (Fuel Type 2) | Combined city and highway car consumption in miles per gallon, for fuel type 2 |
| Engine Cylinders | Number of cylinders of the car |
| Engine Displacement | Measure of the cylinder volume swept by all of the pistons of a piston engine, excluding the combustion chambers |
| Drive | Type of powertrain distribution system that places rotational propulsion, such as rear-wheel,4-Wheel Drive,... |
| Engine Description | Description of some features of the car, such as turbo engine, Stop-Start system, ... |
| Transmission | Manual/Automatic transmission, with number of gears and/or model of transmission |
| Vehicle Class | Type of vehicle, such as Minivan, Trucks,... |
| Time to Charge EV (hours at 120v) | Number of hours required to fully charge an EV car at 120v |
| Time to Charge EV (hours at 240v) | Number of hours required to fully charge an EV car at 240v |
| Range (for EV) | Maximum number of miles possible with a fully charged EV car |
| City Range (for EV - Fuel Type 1) | Maximum number of miles possible with a fully charged EV car in a city |
| City Range (for EV - Fuel Type 2) | Maximum number of miles possible while only using electricity with a fully charged hybrid car in a city |
| Hwy Range (for EV - Fuel Type 1) | Maximum number of miles possible with a fully charged EV car on a highway |
| Hwy Range (for EV - Fuel Type 2) | Maximum number of miles possible while only using electricity with a fully charged hybrid car on a highway |
1 Introduction
During our first year as master’s students in Management with a focus on Business Analytics, we had the opportunity to attend lectures on Machine Learning for Business Analytics. This course covered a range of machine learning techniques applicable to business contexts. We explored supervised methods such as regressions, decision trees, support vector machines, and neural networks, as well as unsupervised methods like clustering, Principal Component Analysis (PCA), Factor Analysis of Mixed Data (FAMD), and auto-encoders. Additionally, the course addressed essential topics such as data splitting, ensemble methods, and performance metrics.
In the context of this course, our group undertook an applied project aimed at integrating theoretical knowledge with practical application. Starting from scratch, we identified a suitable dataset to apply machine learning techniques learned in class to real-world scenarios. We selected a comprehensive dataset encompassing vehicle metrics such as MPG, range, engine statistics, and more, spanning over 100 different brands.
The primary aim of our research was to develop a predictive model that can identify the make (i.e., the brand) of a vehicle based on its attributes, such as consumption, range, and fuel type. To accomplish this, we utilized advanced machine learning algorithms, specifically classification trees (CT) and neural networks (NN). We acknowledged that some vehicles might share identical characteristics but differ in other aspects, so our goal was to create a robust model capable of accurately predicting a vehicle’s brand from its features. Additionally, we investigated the potential of clustering our models using unsupervised learning techniques. This project not only deepened our understanding of machine learning concepts but also demonstrated their practical applicability in solving real-world problems.
The dataset was sourced from data.world, a platform that hosts a diverse range of open-access datasets. It comprises over 45,000 rows and 26 columns, with each row representing a unique vehicle and its corresponding features. The dataset includes comprehensive information on the vehicle’s make, model, year, fuel type, consumption per barrel, highway miles per gallon (MPG), among other features. It is available in CSV format, which facilitates ease of access and analysis.
Given the nature of our objective, being to predict the make of the vehicle, we chose classification models.
In the subsequent sections, we will first provide a comprehensive description of our dataset in the Data Description section, followed by a detailed Exploratory Data Analysis (EDA) in the EDA section. After an in-depth examination of the data, we will proceed to the data cleaning and preparation phase in the Data Cleaning section. Subsequently, we will outline the various methodologies employed in the Methods section. Utilizing the cleaned dataset, we will implement our initial model, a classification tree, in the Classification Tree section to predict the make of the vehicle. This will be followed by the application of a neural network model in the Neural Networks section. We will then transition to unsupervised learning techniques in the Principal Component Analysis and Clustering sections. Our study will converge in the ?Recommendations and Discussionsection, where we will synthesize our results and explore the implications of our findings. The document will conclude with aReferences section, followed by anAnnex`.
2 Data description
For this project, we selected a dataset focused on vehicle characteristics, available as a .csv file from data.world. You can access the dataset via the following link: data.world. It includes a total of 26 features describing 45,896 vehicle models from 141 brands released between 1984 and 2023. Below is a table providing an overview of the available features and their descriptions. You can find a deeper description of the data in Annex 14.2. The analysis of the missing values will be dealt with in the data cleaning section.
2.0.1 Description of the features
3 EDA
Now that we have presented the variables contained in the dataset, let’s try to understand the data structure, characteristics and underlying patterns thanks to an EDA.
3.0.1 Dataset overview
3.0.2 Columns description
Let’s have a quick look at the characteristics of the columns. You will find more statistical details about it in the annexe. ::: {.cell}
:::
Show the code
# Load necessary packages
library(dplyr)
library(knitr)
library(kableExtra)
# Assuming your dataset is named 'data'
# Get the number of rows and columns
num_rows <- nrow(data)
num_cols <- ncol(data)
# Get the frequency of column types
col_types <- sapply(data, class)
type_freq <- table(col_types)
char_count <- ifelse("character" %in% names(type_freq), type_freq["character"], 0)
num_count <- ifelse("numeric" %in% names(type_freq), type_freq["numeric"], 0)
# Create a summary data frame
summary_df <- data.frame(
Name = "data",
Number_of_rows = num_rows,
Number_of_columns = num_cols,
Character = char_count,
Numeric = num_count,
Group_variables = "None"
)
# Display the summary using kable
kable(summary_df, format = "html", table.attr = "style='width:50%;'") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))| Name | Number_of_rows | Number_of_columns | Character | Numeric | Group_variables |
|---|---|---|---|---|---|
| data | 45896 | 26 | 8 | 18 | None |
The dataset that we are working with contains approx. 46’000 rows and 26 columns, each row representing a model from one of the 141 brands. From the data overview, we can see that most of our features are concerning the consumption of the cars. If we now check more in details in the annex, we notice that some variables contain a lot of missing and that the variable “Time.to.Charge.EV..hours.at.120v.” is only containing 0s. We will handle these issues in the section “Data cleaning”.
3.0.3 Exploration of the distribution
Now let’s explore the distribution of the numerical features.
Show the code
# melt.data <- melt(data)
#
# ggplot(data = melt.data, aes(x = value)) +
# stat_density() +
# facet_wrap(~variable, scales = "free")
plot_histogram(data)# Time.to.Charge.EV..hours.at.120v. not appearing because all observations = 0 As the majority of models in our dataset are neither electric vehicles (EVs) nor hybrid cars and because of the nature of some column concerning only these two types of vehicles, the results are showing numerous zero values in several columns. This issue will be addressed during the data cleaning process. Additionally, certain features, such as “Engine Cylinders,” are numerically discrete, as illustrated in the corresponding plot.
3.0.4 Outliers Detection
In order identify occurences that deviate significantly for the rest of the observations, and in order to potentially improve the global quality of the data, we have decided to analyse outliers thanks to boxplots. Here are the result on the numerical columns of the dataset:
Show the code
#tentative boxplots
data_long <- data %>%
select_if(is.numeric) %>%
pivot_longer(cols = c("ID",
"model_year",
"estimated_Annual_Petrolum_Consumption_Barrels", "City_MPG_Fuel_Type_1",
"highway_mpg_fuel_type_1",
"combined_MPG_Fuel_Type_1",
"City_MPG_Fuel_Type_2",
"highway_mpg_fuel_type_2",
"combined_MPG_Fuel_Type_2",
"time_to_Charge_EV_hours_at_120v_",
"charge_time_240v",
"range_for_EV",
"range_ev_city_fuel_type_1",
"range_ev_city_fuel_type_2",
"range_ev_highway_fuel_type_1",
"range_ev_highway_fuel_type_2"), names_to = "variable", values_to = "value")
ggplot(data_long, aes(x = variable, y = value, fill = variable)) +
geom_boxplot(outlier.size = 0.5) + # Make outlier points smaller
facet_wrap(~ variable, scales = "free_y") + # Each variable gets its own y-axis
theme_minimal() +
theme(legend.position = "none", # Hide the legend
axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1, size = 0),strip.text = element_text(size = 7)) + # Rotate x-axis labels
labs(title = "Outlier Detection Boxplots", x = "", y = "Values")Most of our boxplots are showing extreme values. Again, this is due to the small amount of EV and hybrid cars in our dataset compared to the rest of the models and due to the nature of some features, concerning only those type of vehicles. ::: {.cell}
:::
3.0.5 Number of models per make
Now let’s check how many models per make we have in our dataset. In order to have a clear plot, we have decided to keep the top 20 brands among all the make on the graph. All the remaining makes are accessible on the table just below.
Show the code
#Number of occurences/model per make
nb_model_per_make <- data %>%
group_by(make, model) %>%
summarise(Number = n(), .groups = 'drop') %>%
group_by(make) %>%
summarise(Models_Per_Make = n(), .groups = 'drop') %>%
arrange(desc(Models_Per_Make))
#makes with only 1 model
only_one_model_cars <- nb_model_per_make %>%
filter(Models_Per_Make == 1 )
#table globale
datatable(nb_model_per_make,
rownames = FALSE,
options = list(pageLength = 5,
class = "hover",
searchHighlight = TRUE))Show the code
# Option to limit to top 20 makes for better readability
top_n_makes <- nb_model_per_make %>% top_n(20, Models_Per_Make)
#plot
ggplot(top_n_makes, aes(x = reorder(make, Models_Per_Make), y = Models_Per_Make)) +
geom_bar(stat = "identity", color = "black", fill = "grey", show.legend = FALSE) +
labs(title = "Models per Make (Top 20)",
x = "Make",
y = "Number of Models") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10),
axis.text.y = element_text(hjust = 1, size = 10),
plot.title = element_text(size = 14)) +
coord_flip() # Flip coordinates for better readabilityOn the 141 brands, we notice that only 13 brands have more than 100 models in the dataset. Among these, only two of them (Mercedes-Benz and BMW) have more than 400 models presents. Let’s now check the number of make with 1 model in the dataset.
Show the code
#table only 1 model
datatable(only_one_model_cars,
rownames = FALSE,
options = list(pageLength = 5,
class = "hover",
searchHighlight = TRUE))Show the code
ggplot(only_one_model_cars, aes(x = make, y = Models_Per_Make)) +
geom_bar(stat = "identity", color = "black", fill = "grey", show.legend = FALSE) +
labs(title = "Makes with only 1 model in the dataset",
x = "Make",
y = "Number of Models") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10),
axis.text.y = element_text(hjust = 1, size = 10),
plot.title = element_text(size = 14)) +
coord_flip() # Flip coordinates for better readabilityTherefore, we can see that Mercendes-Benz and BMW have significantly more models in our dataset, which means that we are dealing with some imbalances in categories. Therefore, we need to be careful when doing predictions, as will may encounter bias toward these two majority classes. Therefore, there are few technics that can be used to deal with this problem, such as resampling technics, Ensemble Methods (RF, Boosting), tuning probability threshold,…..
https://chatgpt.com/c/09a66e4e-80c6-4fbd-bf4e-73a2b3e44afd
4 Data cleaning
In this section we will handle the missing value of our dataset to make sure that we have a clean dataset to perform our EDA and modeling. We will first visualize the missing values in our dataset and then clean the missing values in the columns that we will use for our analysis. We will also remove some rows and columns that are not relevant for our analysis.
Let’s have a look at the entire dataset and its missing values in grey.
We can see that overall, we do not have many missing values in proportion with the size of our dataset. However, we can see that some columns have a lot of missing values. Below we have the detail of the percentage of missing values by columns.
4.1 Dealing with the columns Engine Cylinders and Engine Displacement
As we can see we have missing in 6 columns. Let’s first have a closer look at the engine cylinders and engine displacement columns. They both have 484 missing values. After some data manipulation, we see that these 484 missing are all electric vehicles and that they all have missing values in the engine cylinders and engine displacement columns. Given that in our dataset we have 484 vehicles, we now that theses missing in these column only concerns electric vehicles. This make sense since electric vehicle do not have an combustion engine and therefore those categories are not really applicable. We will therefore replace all missing values in this two columns with “none”.
As we can see, we still have some missing in the columns “Fuel Type 2”, “Engine Description”, “Drive” and “Transmission”. Let’s investigate the missing in the column “Drive”.
4.2 Dealing with the column Drive, Transmission and Engine Description
We decided to drop the brand with more than 10% of missing values in the “Drive” column. After this operation, we also removed the 8 observations that remained with missing values in the “Transmission” column. We decided to drop the column engine description since it contains missings values for more than a third of our observation.
4.3 Final dataset
The dataset is now cleaned and does not contain any missing values. It contains 42240 observations, 18 features and 129 brands. We renamed the columns and stored it in a csv file (data_cleaned.csv). However, for some models, we need to tackle the unbalanced classes in the target variable. For this reason we also created a new csv file for which we drop the make with less than 10 models (data_cleaned_reduced.csv). This dataset contains 42061 observations, 18 features and 66 brands.
Here are the two cleaned datasets on which we are working on from now on.
Cleaned Dataset
| Name | Number_of_rows | Number_of_columns | Character | Numeric | Group_variables |
|---|---|---|---|---|---|
| data_cleaned | 42240 | 18 | 8 | 5 | None |
Cleaned and Reduced Dataset
| Name | Number_of_rows | Number_of_columns | Character | Numeric | Group_variables |
|---|---|---|---|---|---|
| data_cleaned_reduced | 42061 | 18 | 8 | 5 | None |
5 Methods
This section provides a detailed explanation of the methods used in this study, aimed at ensuring reproducibility. The global strategy and each specific tool employed are described concisely, with relevant properties highlighted. Clear references are provided for further details where necessary.
5.1 Supervised Learning
5.1.1 Classification Tree
As our dataset is dealing with classes, we have decided to use a Classification tree to predict the make of the car.
Classification Trees are constituted of one root node and a hierarchical set of binary rules. Each node represent a decision on an attribute, each branch the outcome of the decision and each leaf a class label (The most important features are located at the top of the graph). Thus, the name is coming from the shape of the representation, that looks like a tree. Even if it can be also applied to regressions, we are using it in our case for classification tasks.
Key parameters and hyperparameters used in classification tree include:
Parameters
Data observations
Method (usually class, which specifies that the decision tree being constructed is for classification tasks)
Hyperparameters
Minsplit : minimum number of observations that must exist in a node for a split to be attempted
minbucket : controls the minimum size of terminal nodes
complexity parameter cp : controls the size of the decision tree. Also prevent overfitting.
Maximum depth : Set a maximum “length” of the tree, starting from the root node.
Random state : Set a specific number (usually 42) to control the randomness of the estimator.
xval : Helps in model validation by performing cross-validation
5.1.2 Our two primary models
Classification Tree without constraints
Classification Tree with Pruning (Setting a limit for the max depth)
In both cases, we are using training and test accuracy in order to know when there is overfitting.
5.1.3 Classification Tree without constraints
The classification tree model serves as the baseline. It consists of the following components:
A Root Node: the highest node in a tree. It represents the entire dataset, before being splitting according to the best features that are going to separate the observations according to specific criterions.
Internal Nodes: the remaining nodes that are not the root or a leaf node. Each internal node splits the data into two or more subsets based on a feature and a threshold value.
Branches and Edges: connections between the nodes. Represent the outcome of a decision performed at each node.
Leaf node: final decision/classification. Each node correspond to a class label.
In our case, the Decision Tree Classifier creates the decision tree. It initialize the process of recursively splitting the dataset into subsets based on the value of the features. It also implicitly defines the loss function, with the two following most common criteria : Gini Impurity (measuring how mixed are the classes in a node, also selecting the best features) and Entropy (different from the Logistic Regression, measure of chaos, indicating the uncertainty reduced by the split). Even by using the stopping rules seen in class (minsplit, cp, minbucket), the tree can sometimes grow too long. Therefore, we need to prune the tree.
5.1.4 Classification Tree with Pruning
Now let’s talk about the classification tree using pruning.
We have the same components as in the base method. Nevertheless, some part of the process is different. In this case we are using a complexity parameter to control the size of the tree. This threshold prevents overfitting.
One popular method for pruning is the 1-SE rule, which involves the following steps: first build a fully grown tree using standard stopping rules; then calculate the error for all possible sub-trees. The best tree with the lowest error is identified and its standard error (SE) is calculated. Trees with an error within one SE of the error of the best tree are considered equivalent, and among these the shortest tree is selected.
In our case, we have decided to tune the max_depth hyperparameter to limit the maximum depth of the decision tree.
5.1.4.1 Software and Tools
rpart : Used for building classification and regression trees
pandas : Library used for data manipulation and analysis.
Scikit-learn: Used for data preprocessing, splitting the dataset, and calculating class weights.
matplotlib : Used for creating static, animated, and interactive visualizations
seaborn : Provides a high-level interface for drawing attractive and informative statistical graphics
5.1.4.2 Sequence of Analysis
After loading and importing our necessary functions R and Python functions, we have prepared the data by first defining the make as the target variable, then encoding categorical variables using Label Encoding to convert them into numerical values. This step is essential as classification tree needs numerical value to work.
We then splited the dataset into training (80%) and testing (20%). This step is essential for preventing overfitting, try to select the best model configuration by training the model, then assess the final model performance on our “unseen data” of the testing set. This helps to build a robust model that generalizes well to new data.
Once the separation done, we finally have trained our decision tree, using the parameters and hyperparameters talked earlier. Then, thanks to the training model, we have been able to use the model on the test set and then to compute the accuracies of both. This step is essential for controlling the performance of the model, also controlling for overfitting.
After verification, we have found overfitting after constructing the base Classification Tree model. We have then decided to prune our tree, by setting different max_depth value in order to see which tradeoff between accuracy and performance (fighting overfitting) was the best.
5.1.5 Neural Network
The use of a neural network was used for a classification task. As already discussed, we want to predict the make of a car utilizing the other features.
A neural network is a computational model inspired by the way biological neural networks in the human brain process information. It consists of interconnected layers of nodes (neurons) where each connection has an associated weight. The network learns to map input data to the desired output by adjusting these weights through a process called training. The model can be tuned using hyperparameters set before the learning process begins that governs the overall behavior and performance of the machine learning model.
Key parameters and hyperparameters used in neural networks include:
Learning Rate: Controls how much the model’s weights are updated with respect to the gradient of the loss function.
Number of Epochs: The number of complete passes through the training dataset.
Batch Size: The number of training examples used in one iteration to update the model’s weights.
Number of Layers: Determines the depth of the network, including input, hidden, and output layers.
Number of Neurons per Layer: The size of each layer, influencing the capacity of the model to learn from data.
Activation Functions: Non-linear functions applied to the output of each neuron, such as ReLU, sigmoid, or softmax.
Optimizer: The algorithm used to update the weights, such as Adam, SGD, or RMSprop.
Loss Function: Measures the difference between the predicted and actual output, guiding the optimizer, e.g., categorical cross-entropy for classification tasks.
Dropout Rate: The fraction of neurons to drop during training to prevent overfitting.
Class Weights: Used to handle imbalanced datasets by giving more importance to underrepresented classes.
5.1.5.0.1 Preprocessing the data
An important step before training our model has been to separate numerical and categorical columns in our data preprocessing because they require different types of handling to prepare them for machine learning algorithms. The numerical columns need to be scaled by adjusting them so they have a mean of zero and a standard deviation of one, which helps the machine learning algorithm perform better. While the categorical columns need to be one-hot encoded which creates a binary column a format that the machine learning model can understand.
5.1.5.1 Our three primary models
A Simple Neural Network Model
A Class-Weighted Neural Network Model
And a Dropout-Enhanced Neural Network Model (also using class-weight)
All three models used the cleaned data set prepared in the data cleaning section. We chose the version with more than 10 observation for each car brands to avoid over imbalanced classes.
Each model’s architecture, training process, and evaluation metrics are described in the following sections.
5.1.5.1.1 Simple Neural Network Model
The simple neural network model serves as the baseline. It consists of the following components:
Input Layer: Matches the number of features in the dataset.
Hidden Layers: Two dense layers with ReLU activation functions.
Output Layer: A dense layer with a softmax activation function for classification. The softmax activation function output a probability for each features input to belong to a specific class (brand).
The model is trained using the Adam optimizer, categorical cross-entropy loss function, and accuracy as the primary metric.
5.1.5.1.2 Class-Weighted Neural Network Model
To address class imbalances, a class-weighted approach has then been applied. This involves assigning higher weights to the minority classes during training. The architecture of the network remains the same as the simple model. The following steps were taken:
Class Weights Calculation: Inverse proportionality to class frequencies.
Model Training: Incorporating class weights into the loss function.
The use of class weights helps in penalizing misclassifications of the minority classes more heavily, thereby improving the model’s performance on imbalanced data.
5.1.5.1.3 Dropout-Enhanced Neural Network Model
The third model incorporates dropout layers to mitigate overfitting while keeping the modification applied in our second model. Dropout layers randomly set a fraction of neurons to zero at each update during training, which helps prevent the network from becoming overly reliant on specific neurons. The deactivation of some neurons only happend during the training phases. When the model is used against the validation and test set, all neurons are active. This has the consequence of seeing higher accuracy on the validation set than the training set.
Model Architecture: Similar to the simple model with additional dropout layers after each dense hidden layer.
Dropout Rate: Will be tuned during the modeling.
The dropout-enhanced model helps improve generalization by reducing the risk of overfitting to the training data.
5.1.5.2 Software and Tools
The following software and tools were used to implement and evaluate the models (non-exhaustive list):
Python: The programming language used for all implementations.
TensorFlow and Keras: Libraries used for building and training neural network models.
Scikit-learn: Used for data preprocessing, splitting the dataset, and calculating class weights.
Pandas and NumPy: Libraries used for data manipulation and numerical computations.
5.1.5.3 Metrics
In our machine learning analysis, we used several key metrics to evaluate our model. The main metrics are the training, validation and test accuracies. The validation accuracy shows how well the model performs on the validation set, helping us tune the model during training. Training accuracy indicates the model’s performance on the training data, which helps identify overfitting. The difference between the training and validation accuracy is useful to detect the presence of overfitting in our models. The test accuracy measures the model’s performance on new, unseen data, providing an unbiased evaluation.
We also used the sensitivity to measure the model’s ability to correctly identify positive casesand the specificity that measures the ability to correctly identify negative cases. Even though, when we started to create our model, we already knew that we had unbalanced classes, these metrics confirmed this with numbers. The last metric that has been used is the Cohen’s kappa that evaluates the agreement between predicted and true labels, accounting for chance, making it more reliable than simple accuracy, especially with imbalanced datasets. These metrics together give a comprehensive view of the model’s performance.
5.1.5.4 Sequence of Analysis
We first encode the categorical variables and normalized the numercial features. We then trained the models and evaluated each one. By evaluating and investigating the model at each step we managed to deal with the different challenges that we faced. We also managed to tune the dropout rate to insure to keep a good model performance while highly decreasing the case of overfitting we were having.
This sequence ensures a systematic approach to model development and evaluation, providing a clear understanding of each step involved.
Note that this section was written and based on the course website (ML_BA). The parts on the dropout layers is based on the following articles: towardsdatascience and python-course
5.2 Unsupervised Learning
In order to see the link between features, we decided to proceed with two unsupervised learning methods which are the Principal Component Analysis and the Clustering methods.
5.2.1 Principal Component Analysis
Features: The features are the ones from the data_cleaned data frame.
Scaling: Switching the features from categorical to factor, then factor to numerical. Then, we applied a spectral decomposition on the correlation matrix.
Variance proportion: We look at the proportion of variance relative to the total variance. Then we can look at the eigenvalues and check the cumulative percentage of variance that explains at least 80% of variance. Then, we obtained the number of dimensions that represent the data well.
Circle of correlation: The circle of correlation allows to check for the correlation between the features. Also, we can summarize into 2 dimensions.
Cos2: The cos2 allows to interpret the quality of the representation through the dimensions of the circle of correlation.
Biplot: Visualization of the observations (or individuals) within the 2 dimensions as well as the features
combination of features
scaling
projection & variance proportion
circle of correlation
cos2
biplot
screeplot
To inspect the data, find/explain clusters, find dependence between the features. PCA can be used for EDA.
To diminish the number of features when there are too many: dimension reduction only keep few first PC.
PCA can only be performed on numerical features. When categorical features are also included in the analysis,
for ordinal data, quick and dirty solution: modalities can be mapped to numbers ( ) respecting their order,
for nominal data: there is no correct solution; especially replacing by numbers is incorrect.
With only categorical data, (Multiple) Correspondence Analysis is a solution. And for mixed data type (categorical and numerical), Factor Analysis of Mixed Data (FAMD) is a solution. However, they are not adapted to large data set.
5.2.2 Clustering
In our
- Partitioning Method
- Numerical –> as we converted into numerical & factors into numerical, use K-Means
For clusters:
The initial clustering is random: assign each instance to one cluster at random.
Compute the centers of the clusters.
Each instance is then re-assigned to the cluster with the closest center.
Step 2. and 3. are repeated until a convergence criterion is satisfied.
-K-means
- Number of cluster
- TWCS
- Elbow
- Silhouette
As we have seen several unsupervised learning tools in class, we have looked at our dataset and decided to start with a Principal Component Analysis, as we have some categorical variables as well as numerical ones. This technique allows to combine features in fewer dimensions according to their similarities. We then proceeded with a clustering method and combined both of them into a vizualisation in order to have a clearer result.
To begin the method, it is crucial to standardize the data, meaning to transform the categorical variables into factors, which in turn is transformed into numerical. Then, to have a small idea of the link between the features before attacking the PCA, we will a correlation heatmap, showing which variable really seem to be somewhat correlated, whether positively and negatively. We will start with the PCA right after and then a screeplot, all of this in order to see the weight of the dimensions as well as the observations in the PCA graph.
Depending on the results of the PCA, we might consider proceeding with a clustering to have a clearer overview of the similarities in the observations and divide them into clusters
To achieve this, the main aspect to consider is to take the results from the PCA. So, to begin with, we will take the PCA coordinates and then proceed with a K-means method. We will then perform an elbow method to check for the optimal number of clusters. Depending on the results, we might consider doing a silhouette plot to check for heterogeneity in the clusters. Finally, as we are doing the PCA right before the clustering, we plan to create a 3D biplot where both the features and the clustered observations can be seen in order to interpret the final results for the unsupervised learning part.
The final aim is to see the link between the features as well as the similarities of the observations.
6 Classification Tree
In this section we are going to perform a classification tree analysis on the dataset. We start first to load the necessary packages and the dataset. We then have prepared the data by encoding categorical variables and splitted it into training and testing sets. We then pruned the tree with different max_depth values to find the optimal tree depth that balances between training and test accuracy.
6.0.1 Loading R packages, Python libraries and Data Preparation
After importing all the necessary libraries and packages, we first started by loading the dataset and identified make as the target variable. We also encoded categorical variables using Label Encoding to convert them into numerical values.
We then splited the dataset into training (80%) and testing (20%). This step is essential for preventing overfitting, try to select the best model configuration by training the model, then assess the final model performance on our “unseen data” of the testing set. This helps to build a robust model that generalizes well to new data. Unfortunately, we will see later that our model is overfitting, meaning that it performs poorly when seeing new data. We will address this issue later. You will find here an plot showing the actual data splitting effectuated. The x axis represent the number of observations in our dataset and the y axis the number of makes.
6.0.2 Training the Decision Tree without Pruning - Results
Starting to train our Decision Tree, we have first decided to do it without any constraints and see the results of it.
DecisionTreeClassifier(random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(random_state=42)
Training Accuracy (without max_depth): 0.8939
Test Accuracy (without max_depth): 0.6979
Training Cohen's Kappa (without max_depth): 0.8888
Test Cohen's Kappa (without max_depth): 0.6836
As we can see, the tree resulting from our training is much complex, takes a lot of time to train and is not interpretable at all. Also, we can see that the accuracy of the training set is higher than both the observed accuracy and cohen’s kappa of the test set, attesting that our model is overfitting.
6.0.3 Training the Decision Tree with K-fold cross-validation - Results
In
Show the code
# !pip install matplotlib seaborn scikit-learn ------- INSTALL IF NEEDED
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn.preprocessing import LabelEncoder
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
# Encode categorical variables
label_encoders = {}
for column in data.columns:
if data[column].dtype == 'object':
le = LabelEncoder()
data[column] = le.fit_transform(data[column])
label_encoders[column] = le
# Split the data into features and target
target = 'make'
features = data.columns.drop(target)
X = data[features]
y = data[target]
# Initialize the classifier
clf = DecisionTreeClassifier(random_state=42)
# Initialize KFold cross-validator
kf = KFold(n_splits=10, shuffle=True, random_state=42)
# Perform k-fold cross-validation
cv_scores = cross_val_score(clf, X, y, cv=kf, scoring='accuracy')
# Print the cross-validation scores
print("Cross-validation scores: ", cv_scores)Cross-validation scores: [0.69341856 0.69839015 0.70288826 0.69625947 0.70643939 0.70123106
0.68868371 0.70336174 0.68158144 0.69247159]
Show the code
print("Mean cross-validation score: ", cv_scores.mean())Mean cross-validation score: 0.6964725378787879
Show the code
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train the classifier on the training set
clf.fit(X_train, y_train)DecisionTreeClassifier(random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(random_state=42)
Show the code
# Predict on the training set and test set
y_train_pred = clf.predict(X_train)
y_test_pred = clf.predict(X_test)
# Calculate the training and test set accuracies
train_accuracy = accuracy_score(y_train, y_train_pred)
test_accuracy = accuracy_score(y_test, y_test_pred)
print(f"Training set accuracy: {train_accuracy:.4f}")Training set accuracy: 0.8939
Show the code
print(f"Test set accuracy: {test_accuracy:.4f}")Test set accuracy: 0.6979
Show the code
# Visualize the decision tree
from sklearn.tree import plot_tree
# plt.figure(figsize=(20, 10))
# plot_tree(clf, filled=True, feature_names=features, class_names=label_encoders[target].classes_)
# plt.title('Decision Tree')
# plt.show()6.0.4 Training the Decision Tree with Pruning method - Results
In order to have a less complex tree and to fight overfittingness, we have decided to prune our tree and then train it. We chose to prune the tree by trying a few max_depth parameter values to control the tree’s growth (5, 10, 15, 20, 25, 30). We want here to find the optimal tree depth that balances between training and test accuracy. for visibility reasons, we have decided to represent a classification tree with only 5 max_depth. “None” is actually the training without any max_depth parameter. Here are our new results:
Train Accuracy Test Accuracy Train Cohen's Kappa Test Cohen's Kappa
5 0.260476 0.254972 0.217796 0.211947
10 0.488666 0.467448 0.462008 0.439267
15 0.720348 0.635180 0.706778 0.617217
20 0.852006 0.692590 0.845006 0.678054
25 0.889796 0.697206 0.884545 0.682865
30 0.893880 0.697088 0.888805 0.682739
The accuracy of the model improved as the depth of the tree increased, with a max_depth of 25 or 30 providing the best test accuracy and Cohen’s kappa reaching up to 70%. A max_depth of 5 resulted in significantly lower accuracy, indicating that it is not the optimal choice. A max_depth of 10 or 15 seems to be the best compromise between overall accuracy and avoiding overfitting.
Pruning the tree with a max_depth of 25 increased the accuracy from 69.84% to 70%, reducing the gap between the test and training sets. This pruning helps to improve the generalisation performance of the model by preventing it from becoming too complex and overfitting the training data.
6.0.5 Variable Importance
As we have used a classification tree for the prediction, the most important variable in the model are the ones at the top of the graph. Let’s visualize the ranking of the features importance.
Show the code
# Visualize the feature importances
plt.figure(figsize=(10, 6))
plt.barh(importances_df['Feature'], importances_df['Importance'], color='skyblue')
plt.xlabel('Feature Importance')
plt.ylabel('Feature')
plt.title('Feature Importance in Pruned Decision Tree')
plt.gca().invert_yaxis() # To display the most important feature at the top
plt.show()Show the code
# Print the sorted feature importances
# print(importances_df)We can see that among the top 3 most important features figure in order : the engine displacement, the model of the year and the class of the vehicle. On another hand, we can see that some features such as range_ev_city_fuel_type_2, range_ev_highway_fuel_type_2 and range_ev_city_fuel_type_1 are not important for our model for making predictions. This is certainly due to the quantity of EV and hybrid cars in the dataset and these features that are only concerning these types of vehicles.
7 Neural Networks
In this section, we will build a neural network model to predict the make of a car based on the features at our disposal. We will preprocess the data, split it into training and testing sets, define the neural network architecture, compile the model, train it and evaluate its performance.
7.1 Preprocessing and splitting the data
The dataset contains different types of data. Some columns are numerical (like “city_mpg_fuel_type_1” or “charge_time_240v”), and some are categorical (“vehicle_class” or “fuel_type”). We identify and separate these two types of columns and preprocessed them.
The data is split into two parts: training and testing. The training set is used to train the model, and the testing set is used to evaluate its performance. This split ensures that we can test how well the model generalizes to new, unseen data.
7.1.1 Building the neural network models and training them
7.1.1.1 Base Neural Network
We chose to use a neural network. This neural network consists of layers of neurons, where each layer applies transformations to the data. The first layer takes the input features. Then some Hidden layers help the model learn complex patterns. In the end, the output layer predicts the probability of each car manufacturer. The first layes, the input layer, takes the input features. The second layers is set to 128 neurons, the third to 64 neurons and the last layer, the output layer, has as many neurons as there are car manufacturers. The activation function used in the hidden layers is the Rectified Linear Unit (ReLU), and the output layer uses the Softmax activation function. The model is compiled with the Adam optimizer and the categorical crossentropy loss function.
# Define the Base neural network model
model_base = Sequential([
Input(shape=(X_train.shape[1],)),
Dense(128, activation='relu'),
Dense(64, activation='relu'),
Dense(y_train.shape[1], activation='softmax')])We used activation functions in the hidden layers to introduce non-linearity into the model. The ReLU activation function is used in the hidden layers because it is computationally efficient and helps the model learn complex patterns in the data. The Softmax activation function is used in the output layer because it converts the model’s raw output into probabilities that sum to one. This allows us to interpret the model’s output as the probability of each car manufacturer.
We used the following hyperparameters (non-exhaustive list):
- epochs: 150 (Corresponds to the number of times the model sees the entire dataset during training.)
- batch_size: 32 (Corresponds to the number of samples that the model processes before updating the weights.)
- validation_split: 0.2 (Corresponds to the fraction of the training data to be used as validation data.)
The model is trained for 5 epochs with a batch size of 32. The validation split is set to 0.2, which means that 20% of the training data is used for validation.
Overall, the model perform well but we have’t dealt with the issue of unbalanced classes yet. Let’s have a look at the distribution of the sensitivity and specificity for each class.
7.1.1.1.1 Issue of unbalanced classes
The isse of unbalanced classes, as explained previously can highly weekend the model ability to generalize to new data. The model, will automatically prefer to predict the most frequent classes. We can see in the boxplots below the distribution of the sensitivity and specificity for the classes. Even though, we already dealt in part with the unbalanced class during the cleaning process, as seen in the graph (((((((((((((((add graph reference))))))))))))))), there’s still big differences between the classes.
{'whiskers': [<matplotlib.lines.Line2D object at 0x37d0aa240>, <matplotlib.lines.Line2D object at 0x37de6c1d0>, <matplotlib.lines.Line2D object at 0x37d25dc10>, <matplotlib.lines.Line2D object at 0x37d25e4b0>], 'caps': [<matplotlib.lines.Line2D object at 0x37de6c830>, <matplotlib.lines.Line2D object at 0x37de6cb00>, <matplotlib.lines.Line2D object at 0x37ddb6e10>, <matplotlib.lines.Line2D object at 0x37de6d2e0>], 'boxes': [<matplotlib.lines.Line2D object at 0x37d0aa300>, <matplotlib.lines.Line2D object at 0x37d2fd940>], 'medians': [<matplotlib.lines.Line2D object at 0x37de6ce00>, <matplotlib.lines.Line2D object at 0x37de6d580>], 'fliers': [<matplotlib.lines.Line2D object at 0x37de6d070>, <matplotlib.lines.Line2D object at 0x37de6d850>], 'means': []}
By examining the boxplot representing the distribution of sensitivity and specificity across the classes, we observe clear evidence of class imbalance. Sensitivity and specificity are not consistent across the classes. Specificity, which measures how well the model identifies true negatives, is very high for every class. This indicates that the model is effective at detecting instances that do not belong to a given class. However, sensitivity, which measures the true positive rate and reflects how well the model correctly predicts the make of a car for a specific brand, is not as high. This suggests that the model is not performing well for all classes.
For some classes with more vehicle models, the model tends to predict those classes more frequently, leading to higher accuracy but lower sensitivity for rarer classes. To address this issue, we will use class weights to ensure the model performs more evenly across all classes.
7.1.1.1.2 Adding class weights to the model
This technique has been explained in the methods section but in a nutshell, it is a way to penalize the model for misclassifying the minority class more than the majority class. This helps the model to learn the patterns in the minority class better and improve its performance on the test set.
{'whiskers': [<matplotlib.lines.Line2D object at 0x37dc05370>, <matplotlib.lines.Line2D object at 0x37f0f39e0>, <matplotlib.lines.Line2D object at 0x37fc18b30>, <matplotlib.lines.Line2D object at 0x37dc059d0>], 'caps': [<matplotlib.lines.Line2D object at 0x37f0f30b0>, <matplotlib.lines.Line2D object at 0x37f0f2000>, <matplotlib.lines.Line2D object at 0x37dcd67e0>, <matplotlib.lines.Line2D object at 0x37dc2c680>], 'boxes': [<matplotlib.lines.Line2D object at 0x37fb4aae0>, <matplotlib.lines.Line2D object at 0x37fc18e90>], 'medians': [<matplotlib.lines.Line2D object at 0x37ddb74a0>, <matplotlib.lines.Line2D object at 0x37f082030>], 'fliers': [<matplotlib.lines.Line2D object at 0x37ddb6c60>, <matplotlib.lines.Line2D object at 0x37fc18740>], 'means': []}
As we can see the model is taking better care of the minority classes and overall the sensitivity is higher across the classes. The sensitivity and specificity are more consistent across the classes. The model is better at generalizing to new data. In our case, this method does not eliminate completely the issue of unbalanced classes. Given the structure of our data and the discrepancy of our classes, we will use this technique for the following neural networks and move on.
7.1.1.1.3 Model performance
We can now look at the evolution of the accuracy of our model during the training process with the following plot.
As we can see, at each epoch, the accuracy is increasing and the loss is decreasing. The model is learning from the training data and improving its predictions.
But, in the end, we have a case of overfitting. The model performs well on the training data but not as well on the testing data. This is an issue because it limits the possibility of generalizing the model to new data.
Performance of the model with weighted class:
Final Training Accuracy: 41.03%
Final Validation Accuracy: 45.69%
Test Set Accuracy: 43.39%
Overall, the performance of the model is still good. However the quality can be improved. To address the issue of overfitting, we will introduce Dropout layers in the neural network.
7.1.1.2 Neural Network with Dropout layers
Dropout layers randomly set a fraction of input units to zero during training, which helps prevent overfitting by forcing the model to learn more robust features. We will tune the dropout rate to find the optimal value that balances training and validation accuracy and that insure to reduce overfitting.
We used the following hyperparameters (non-exhaustive list):
- epochs: 150 (Corresponds to the number of times the model sees the entire dataset during training.)
- batch_size: 32 (Corresponds to the number of samples that the model processes before updating the weights.)
- validation_split: 0.2 (Corresponds to the fraction of the training data to be used as validation data.)
- dropout_rate: varies (Corresponds to the fraction of neurons to drop during training.)
We will try 5 different dropout rates in addition of the case of no dropout. We will train the model with each dropout rate and evaluate its performance on the validation and test sets. We will then plot the training, validation, and test accuracies for each dropout rate to find the optimal value.
# Function to create and compile the model
def create_model(dropout_rate=0.0):
model = Sequential([
Input(shape=(X_train.shape[1],)),
Dense(128, activation='relu'),
Dropout(dropout_rate),
Dense(64, activation='relu'),
Dropout(dropout_rate),
Dense(y_train.shape[1], activation='softmax')
])
model.compile(optimizer=Adam(), loss='categorical_crossentropy', metrics=['accuracy'])
return model7.1.1.2.1 Selection of the best dropout rate
We can see that the model with a dropout rate of 0.1 has the best balance between reducing drastically the overfitting problem and keeping a good overall accuracy. This model has a good balance between training and validation accuracy, and it generalizes well to new data. It also eliminate the overfitting issue. We will use this dropout rate of 0.1 to train the final model with dropout layers.
7.1.1.2.2 Model performance
We see that the model with dropout layers performs better that the one without it. We reached a better accuracy on the validation set and the model is clearly not overfitting as much. It is interesting to note that the validation accuracy is higher than the training accuracy. This is a good sign that the model is generalizing well to new data. It is also interesting to note that, as predicted, we see that the validation accuracy is higher than the training accuracy. This is due to the way dropout layers work. The model does not need early stopping in our case since the accuracies are not deacreasing and the loss is not increasing.
Performance of the model with weighted class and dropout layers:
Final Training Accuracy: 34.86%
Final Validation Accuracy: 42.30%
Test Set Accuracy: 41.27%
Our final accuracy of our model are not as great as we had with our first model but the model we are using is at least better at representing the data and generalizing to new data. We aslo computed the Cohen’s Kappa score which is a good indicator of the model’s performance. And as we can see, the model performs well.
Cohen's Kappa Score: 40.68%
Show the code
source(here::here("scripts","setup.R"))
library(data.table)
data_cleaned <- fread(here::here("data", "data_cleaned.csv"))8 Principal Component Analysis
In order to see the link between the features, we can use a dimension reduction technique such as the Principal Component Analysis, aiming to link the features according to their similarities across instances and combine features in fewer dimensions.
8.1 Data Standardization
Show the code
data_prepared <- data_cleaned %>%
mutate(across(where(is.character), as.factor)) %>%
mutate(across(where(is.factor), as.numeric)) %>%
scale() # Standardizes numeric data including converted factorsWe begin by standardizing our data, meaning trasnforming our character variables into factors and then the factors into numeric.
8.2 Heatmap
Show the code
cor_matrix <- cor(data_prepared) # Calculate correlation matrix
# Melt Correlation Matrix
melted_cor_matrix <- melt(cor_matrix)
# Heatmap
ggplot(melted_cor_matrix, aes(Var1, Var2, fill = value)) +
geom_tile(color = "white") +
geom_text(aes(label = sprintf("%.2f", value)), color = "black", size = 3.5) +
scale_fill_gradient2(low = "lightblue", high = "darkblue", mid = "blue", midpoint = 0, limit = c(-1,1),
name = "Spearman\nCorrelation") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
axis.text.y = element_text(angle = 45, hjust = 1),
plot.title = element_text(hjust = 0.5),
plot.title.position = "plot") +
labs(x = 'Variables', y = 'Variables',
title = 'Correlations Heatmap of Features') We used this heatmap to check the correlation between the variables. As we can see, some variables seem to be strongly correlated, but most of them don’t seem to be too strongly correlated, whether positively or negatively. Let’s now look into the link between the features using a biplot, which combines the observations as well as the features.
8.3 Biplot
Show the code
pca_results <- PCA(data_prepared, graph = FALSE)
summary(pca_results)
Call:
PCA(X = data_prepared, graph = FALSE)
Eigenvalues
Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
Variance 4.616 3.735 2.126 1.292 1.019 0.988 0.856
% of var. 25.644 20.748 11.809 7.177 5.660 5.490 4.753
Cumulative % of var. 25.644 46.392 58.201 65.378 71.038 76.527 81.280
Dim.8 Dim.9 Dim.10 Dim.11 Dim.12 Dim.13 Dim.14
Variance 0.827 0.765 0.549 0.510 0.349 0.193 0.137
% of var. 4.594 4.250 3.047 2.834 1.941 1.071 0.759
Cumulative % of var. 85.875 90.124 93.172 96.005 97.946 99.017 99.777
Dim.15 Dim.16 Dim.17 Dim.18
Variance 0.027 0.008 0.003 0.002
% of var. 0.150 0.045 0.018 0.011
Cumulative % of var. 99.926 99.971 99.989 100.000
Individuals (the 10 first)
Dist Dim.1 ctr cos2 Dim.2 ctr
1 | 3.334 | -1.087 0.001 0.106 | 0.065 0.000
2 | 3.387 | -0.907 0.000 0.072 | -0.079 0.000
3 | 3.522 | -1.153 0.001 0.107 | -0.031 0.000
4 | 2.761 | -1.068 0.001 0.150 | 0.012 0.000
5 | 2.713 | -0.991 0.001 0.133 | -0.022 0.000
6 | 2.713 | -0.991 0.001 0.133 | -0.022 0.000
7 | 2.847 | -1.039 0.001 0.133 | -0.066 0.000
8 | 2.876 | -1.151 0.001 0.160 | -0.019 0.000
9 | 4.850 | -1.810 0.002 0.139 | 0.346 0.000
10 | 3.313 | -1.686 0.001 0.259 | 0.231 0.000
cos2 Dim.3 ctr cos2
1 0.000 | 0.018 0.000 0.000 |
2 0.001 | 2.589 0.007 0.584 |
3 0.000 | 2.603 0.008 0.546 |
4 0.000 | 0.658 0.000 0.057 |
5 0.000 | 0.609 0.000 0.050 |
6 0.000 | 0.609 0.000 0.050 |
7 0.001 | 0.490 0.000 0.030 |
8 0.000 | 0.546 0.000 0.036 |
9 0.005 | -0.005 0.000 0.000 |
10 0.005 | 1.551 0.003 0.219 |
Variables (the 10 first)
Dim.1 ctr cos2 Dim.2 ctr cos2
make | 0.106 0.242 0.011 | -0.092 0.229 0.009 |
model_year | 0.347 2.605 0.120 | 0.032 0.028 0.001 |
vehicle_class | -0.141 0.428 0.020 | 0.051 0.071 0.003 |
drive | -0.038 0.031 0.001 | 0.003 0.000 0.000 |
engine_cylinders | 0.077 0.130 0.006 | -0.073 0.141 0.005 |
engine_displacement | 0.125 0.338 0.016 | -0.104 0.292 0.011 |
transmission | -0.498 5.376 0.248 | 0.069 0.128 0.005 |
fuel_type_1 | -0.419 3.802 0.175 | 0.223 1.328 0.050 |
city_mpg_fuel_type_1 | 0.837 15.173 0.700 | -0.308 2.532 0.095 |
highway_mpg_fuel_type_1 | 0.800 13.860 0.640 | -0.313 2.630 0.098 |
Dim.3 ctr cos2
make -0.472 10.475 0.223 |
model_year -0.120 0.679 0.014 |
vehicle_class 0.474 10.568 0.225 |
drive 0.158 1.172 0.025 |
engine_cylinders 0.755 26.797 0.570 |
engine_displacement 0.851 34.040 0.724 |
transmission -0.018 0.016 0.000 |
fuel_type_1 -0.170 1.358 0.029 |
city_mpg_fuel_type_1 -0.242 2.753 0.059 |
highway_mpg_fuel_type_1 -0.351 5.790 0.123 |
Show the code
# Biplot Graph
fviz_pca_biplot(pca_results,
geom.ind = "point",
geom.var = c("arrow", "text"),
col.ind = "cos2",
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE
)The biplot shows several information. First, the two dimensions epxlain almost 50% of the total variance of the data. Then, each point represents an observation and its color represent the quality of the representation of the variables. Looking at the cos2 gradient, the redder the dot, the better the quality of representation. Then, the arrows (or vectors) represent the features. The vectors pointing in similar directions represent some positive correlations, whereas the ones going opposite represent negative correlations. Orthogonal variables are not correlated.
Taking all of these into account, we can interpret this graph in the following way: we notice that the variables linking the mpg and the range for a fuel type (e.g. fuel type 1) go in the same direction and all seem to be positively correlated, and they are uncorrelated to the same characteristics of the other fuel type (e.g. fuel type 2). Also, the mpg and range seem to be negatively correlated to their own fuel type. Moreover, fuel type 1 is uncorrelated to fuel type 2, which makes sense. We can then explain that Dimension 1 represents the “non-hybrid” characteristics of vehicles whereas dimension 2 seems more difficult to interpret just with this graph. It can potentially be linked to “hybrid” characteristics of vehicles
Also, looking at the biplot, there seems to be 3 distinct “clusters” in the observations. Thus, we will proceed in a cluster analysis to verify this.
8.4 Screeplot
Show the code
# Use PCA results to generate the screeplot
fviz_eig(pca_results,
addlabels = TRUE,
ylim = c(0, 100),
barfill = "lightblue",
barcolor = "black",
main = "Scree Plot of PCA") Taking the screeplot into account, 7 dimensions are needed to reach at least 80% of the variance, meaning the features might be relatively independent. It is already shown in the biplot above, as most arrows in the middle seem to be shorter and the cos2 are low, meaning that the features might be more linked to other dimensions than the first 2 dimensions.
9 Clustering
9.1 Cluster Selection
Show the code
# Get PCA coordinates
pca_coords_df <- data.frame(pca_results$ind$coord)
set.seed(123) # for reproducibility
# Use Elbow Method
wcss <- vector()
for (i in 1:10) {
km <- kmeans(pca_coords_df[, 1:3], centers = i)
wcss[i] <- sum(km$withinss)
}
# Elbow Plot
elbow_data <- data.frame(Clusters = 1:10, WCSS = wcss)
elbow_plot <- ggplot(elbow_data, aes(x = Clusters, y = WCSS)) +
geom_point() +
geom_line() +
geom_text(aes(label = Clusters), vjust = -0.5) +
ggtitle("Elbow Method for Optimality") +
xlab("Number of Clusters") +
ylab("Within-Cluster Sum of Squares (WCSS)")
# Print the elbow plot
elbow_plotShow the code
# Looking at Elbow, Optimal nb of clusters = 3
optimal_clusters <- 3
km_result <- kmeans(pca_coords_df[, 1:3], centers = optimal_clusters) # KMeans
# Add cluster assignments to PCA coordinates
pca_coords_df$cluster <- km_result$cluster
# Cluster Plot
cluster_plot <- ggplot(pca_coords_df, aes(x = Dim.1, y = Dim.2, color = factor(cluster))) +
geom_point() +
ggtitle("Cluster") +
xlab(paste("Dim1 (", round(pca_results$eig[1, 2], 1), "%)", sep = "")) +
ylab(paste("Dim2 (", round(pca_results$eig[2, 2], 1), "%)", sep = "")) +
scale_color_manual(values = c("lightblue", "lightpink", "lightgreen"))
cluster_plotUsing the elbow method for clustering allows to show us the optimal number of cluster. In our case, we seem to have 2 elbows, so the first one indicates the optimality at 3 clusters. We will work with 3 clusters and see what the second elbow represents. Then, we will at the cluster plot which shows the 3 clusters. We will go deeper in the interpretation by looking at the 3D version of the biplot, which includes both the PCA and the clustering.
9.2 3D Biplot
Show the code
# Calculate Cluster Centers
cluster_centers <- aggregate(pca_coords_df[, 1:3], by = list(cluster = pca_coords_df$cluster), FUN = mean)
# Get loadings for variables
loadings <- data.frame(
variable = rownames(pca_results$var$coord),
pca_results$var$coord[, 1:3]
)
# Create the 3D scatter plot with clusters and loadings
fig <- plot_ly() %>%
add_trace(
data = pca_coords_df,
x = ~Dim.1,
y = ~Dim.2,
z = ~Dim.3,
color = ~factor(km_result$cluster),
colors = c("lightblue", "lightpink", "lightgreen"),
type = 'scatter3d',
mode = 'markers',
marker = list(size = 3)
) %>%
add_trace(
data = cluster_centers,
x = ~Dim.1,
y = ~Dim.2,
z = ~Dim.3,
text = ~paste("Cluster", cluster),
type = 'scatter3d',
mode = 'text',
textposition = 'top center',
textfont = list(color = 'black', size = 10)
)
# Scale factor for loadings arrows
scale.loads <- 10.0
# Add loadings as arrows
for (k in 1:nrow(loadings)) {
fig <- fig %>%
add_trace(
x = c(0, loadings$Dim.1[k]) * scale.loads,
y = c(0, loadings$Dim.2[k]) * scale.loads,
z = c(0, loadings$Dim.3[k]) * scale.loads,
type = 'scatter3d',
mode = 'lines+markers',
line = list(width = 4, color = 'blue'),
marker = list(size = 2, color = 'blue'),
showlegend = FALSE
) %>%
add_trace(
x = loadings$Dim.1[k] * scale.loads,
y = loadings$Dim.2[k] * scale.loads,
z = loadings$Dim.3[k] * scale.loads,
text = loadings$variable[k],
type = 'scatter3d',
mode = 'text',
textposition = 'top center',
textfont = list(color = 'blue', size = 10),
showlegend = FALSE
)
}
# Layout
fig <- fig %>%
layout(
title = "PCA - 3D Biplot with Features",
scene = list(
xaxis = list(title = paste("Dim1 (", round(pca_results$eig[1, 2], 1), "%)", sep = "")),
yaxis = list(title = paste("Dim2 (", round(pca_results$eig[2, 2], 1), "%)", sep = "")),
zaxis = list(title = paste("Dim3 (", round(pca_results$eig[3, 2], 1), "%)", sep = ""))
)
)
# Display the plot
figLooking at this 3D biplot, we can clearly see the 3 different clusters. Cluster 1 seems to englobe the observations from characteristics of the vehicle with fuel type 2, meaning the hybrid cars. Then, cluster 2 links observations from the different car characteristics such as the vehicle class, the engine displacement and cylinders, the drive, the make, the fuel types and the model year. Finally, cluster 3 englobes the characteristics of vehicles with fuel type 1, meaning normal gas cars.
9.3 Silhouette Plot
Show the code
# Set seed for reproducibility
set.seed(123)
# Sample a subset of the data if needed
sample_indices <- sample(1:nrow(pca_coords_df), 30000) # Adjust sample size as needed
sampled_data <- pca_coords_df[sample_indices, 1:3]
# Perform KMeans clustering
optimal_clusters <- 3 # Replace with your optimal number of clusters
kmeans_result <- kmeans(sampled_data, centers = optimal_clusters)
# Create a data frame with the clustering results
clustered_data <- data.frame(sampled_data, cluster = kmeans_result$cluster)
# Define the number of samples per cluster
samples_per_cluster <- min(table(clustered_data$cluster))
# Sample equal number of points from each cluster
equal_sampled_data <- do.call(rbind, lapply(1:optimal_clusters, function(cluster_num) {
cluster_subset <- clustered_data[clustered_data$cluster == cluster_num, ]
cluster_sample_indices <- sample(1:nrow(cluster_subset), samples_per_cluster)
return(cluster_subset[cluster_sample_indices, ])
}))
# Calculate silhouette values for the sampled data
equal_sampled_silhouette_values <- silhouette(equal_sampled_data$cluster, dist(equal_sampled_data[, 1:3]))
# Define the colors for each cluster
colors <- c("lightblue", "lightpink", "lightgreen")
# Plot silhouette values for the sampled data with custom colors
fviz_silhouette(equal_sampled_silhouette_values, palette = colors) +
ggtitle("Silhouette Plot for Sampled KMeans Clustering") +
xlab("Cluster") +
ylab("Silhouette Width") cluster size ave.sil.width
1 1 517 0.54
2 2 517 0.12
3 3 517 0.41
In order to provide a silhouette plot, we considered taking only a sample of the dataset, as it would have been to heavy to run otherwise. Here, we can notice that clusters 1 and 3 both seem to have a high silhouette width, indicating that they are well-clustered. The widths are also mainly above the average width (the red line), meaning that there is a good consistency within the cluster. On the other hand, there is more variation in the 2nd cluster, indicating some heterogeneity within the cluster. The cluster contains some negative silhouette widths and a lot of the observations are below the red line, all of these meaning that some observations might be better in another cluster. For simplification purpose, we decide to stay with 3 clusters and we will provide a 3D biplot with 6 clusters in the annex.qmd.
To sum the unsupervised learning part, we can clearly say that there seems to be some factors that are linked together and some observations that can be linked into 3 clusters. These clusters being “Hybrid Vehicles”, “Vehicle Characteristics” and “Non-Hybrid Vehicles” makes sense, as we either have a 1 propulsion type of car or hybrid cars, and the vehicle characteristics being another cluster can be explained in the fact both hybrid and non-hybrid vehicles can share some same vehicle characteristics (not all). As for the features, we need 7 dimensions to explain at least 80% of the total variance, meaning that not all the features are concretely linked together and the links are “moderately strong”.
10 Summary of what was observed from the results
For the neural networks model to predict the make (brand) of a vehicle based on its technical characteristics, we managed to train a model with a high kappa score of almost 70%. We have been working with unbalanced classes. Our first neural network model was trained without addressing this problem. For this reason, our first model was very poor at predicting the minority class. To remedy this issue, we applied weighted classes to the loss function so that the model could give more importance to the minority class. We then had to address the presence of overfitting. The model’s validation and test accuracy were noticeably higher than the training accuracy. We applied a technique that we discovered but hadn’t seen in class called dropout layers to our model to mitigate this issue. This technique was interesting to work with. As expected, we saw that the validation accuracy was higher than the test and training accuracy. This is due to the way dropout layers are applied to the model. Some of the neurons are only disabled during the training phase and not during the testing phase. We decided to tune this hyperparameter to reduce the gap between the validation and test accuracy while ensuring not to drop our accuracy too much. In the end, we managed to obtain a model after 150 epochs that had a kappa score and an accuracy on the test set of 70%. We can say that the model is able to predict the make of a vehicle based on its technical characteristics with good accuracy.
11 Implications for the business (link with the context and the initial objectives of the study)
12 Limitations (this may be the subject of a separated section). An opening is often expected here: “What next?” “What can be done for improvement”, etc.?
13 References
Azizi, I., & Boldi, M.-O. (2024). MLBA - S24 - Machine Learning in Business Analytics 2023-2024. MLBA - S24. Retrieved May 19, 2024, from https://do-unil.github.io/mlba/
Yadav, H. (2023, 31 mai). Dropout in Neural Networks - Towards Data Science. Medium. Retrieved May 19, 2024, from https://towardsdatascience.com/dropout-in-neural-networks-47a162d621d9
Klein, B. (2024, 19 avril). 21. Dropout Neural Networks in Python | Machine Learning. Retrieved May 19, 2024, from https://python-course.eu/machine-learning/dropout-neural-networks-in-python.php#:~:text=The%20dropout%20approach%20means%20that,learn%20set%20with%20this%20network.
14 Annex
14.1 Data Columns Detailed
| Name | data |
| Number of rows | 45896 |
| Number of columns | 26 |
| _______________________ | |
| Column type frequency: | |
| character | 8 |
| numeric | 18 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| Make | 0 | 1 | 3 | 34 | 0 | 141 | 0 |
| Model | 0 | 1 | 1 | 47 | 0 | 4762 | 0 |
| Fuel.Type.1 | 0 | 1 | 6 | 17 | 0 | 6 | 0 |
| Fuel.Type.2 | 0 | 1 | 0 | 11 | 44059 | 5 | 0 |
| Drive | 0 | 1 | 0 | 26 | 1186 | 8 | 0 |
| Engine.Description | 0 | 1 | 0 | 46 | 17031 | 590 | 0 |
| Transmission | 0 | 1 | 0 | 32 | 11 | 41 | 0 |
| Vehicle.Class | 0 | 1 | 4 | 34 | 0 | 34 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| ID | 0 | 1.00 | 23102.11 | 13403.10 | 1.00 | 11474.75 | 23090.50 | 34751.25 | 46332.00 | ▇▇▇▇▇ |
| Model.Year | 0 | 1.00 | 2003.61 | 12.19 | 1984.00 | 1992.00 | 2005.00 | 2015.00 | 2023.00 | ▇▆▆▇▇ |
| Estimated.Annual.Petrolum.Consumption..Barrels. | 0 | 1.00 | 15.33 | 4.34 | 0.05 | 12.94 | 14.88 | 17.50 | 42.50 | ▁▇▃▁▁ |
| City.MPG..Fuel.Type.1. | 0 | 1.00 | 19.11 | 10.31 | 6.00 | 15.00 | 17.00 | 21.00 | 150.00 | ▇▁▁▁▁ |
| Highway.MPG..Fuel.Type.1. | 0 | 1.00 | 25.16 | 9.40 | 9.00 | 20.00 | 24.00 | 28.00 | 140.00 | ▇▁▁▁▁ |
| Combined.MPG..Fuel.Type.1. | 0 | 1.00 | 21.33 | 9.78 | 7.00 | 17.00 | 20.00 | 23.00 | 142.00 | ▇▁▁▁▁ |
| City.MPG..Fuel.Type.2. | 0 | 1.00 | 0.85 | 6.47 | 0.00 | 0.00 | 0.00 | 0.00 | 145.00 | ▇▁▁▁▁ |
| Highway.MPG..Fuel.Type.2. | 0 | 1.00 | 1.00 | 6.55 | 0.00 | 0.00 | 0.00 | 0.00 | 121.00 | ▇▁▁▁▁ |
| Combined.MPG..Fuel.Type.2. | 0 | 1.00 | 0.90 | 6.43 | 0.00 | 0.00 | 0.00 | 0.00 | 133.00 | ▇▁▁▁▁ |
| Engine.Cylinders | 487 | 0.99 | 5.71 | 1.77 | 2.00 | 4.00 | 6.00 | 6.00 | 16.00 | ▇▇▅▁▁ |
| Engine.Displacement | 485 | 0.99 | 3.28 | 1.36 | 0.00 | 2.20 | 3.00 | 4.20 | 8.40 | ▁▇▅▂▁ |
| Time.to.Charge.EV..hours.at.120v. | 0 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ▁▁▇▁▁ |
| Time.to.Charge.EV..hours.at.240v. | 0 | 1.00 | 0.11 | 1.01 | 0.00 | 0.00 | 0.00 | 0.00 | 15.30 | ▇▁▁▁▁ |
| Range..for.EV. | 0 | 1.00 | 2.36 | 24.97 | 0.00 | 0.00 | 0.00 | 0.00 | 520.00 | ▇▁▁▁▁ |
| City.Range..for.EV…Fuel.Type.1. | 0 | 1.00 | 1.62 | 20.89 | 0.00 | 0.00 | 0.00 | 0.00 | 520.80 | ▇▁▁▁▁ |
| City.Range..for.EV…Fuel.Type.2. | 0 | 1.00 | 0.17 | 2.73 | 0.00 | 0.00 | 0.00 | 0.00 | 135.28 | ▇▁▁▁▁ |
| Hwy.Range..for.EV…Fuel.Type.1. | 0 | 1.00 | 1.51 | 19.70 | 0.00 | 0.00 | 0.00 | 0.00 | 520.50 | ▇▁▁▁▁ |
| Hwy.Range..for.EV…Fuel.Type.2. | 0 | 1.00 | 0.16 | 2.46 | 0.00 | 0.00 | 0.00 | 0.00 | 114.76 | ▇▁▁▁▁ |
14.2 Data Summary
| Variable | N | Mean | Std. Dev. | Min | Pctl. 25 | Pctl. 75 | Max |
|---|---|---|---|---|---|---|---|
| ID | 45896 | 23102 | 13403 | 1 | 11475 | 34751 | 46332 |
| Model.Year | 45896 | 2004 | 12 | 1984 | 1992 | 2015 | 2023 |
| Estimated.Annual.Petrolum.Consumption..Barrels. | 45896 | 15 | 4.3 | 0.047 | 13 | 18 | 43 |
| Fuel.Type.1 | 45896 | ||||||
| ... Diesel | 1254 | 3% | |||||
| ... Electricity | 484 | 1% | |||||
| ... Midgrade Gasoline | 155 | 0% | |||||
| ... Natural Gas | 60 | 0% | |||||
| ... Premium Gasoline | 14138 | 31% | |||||
| ... Regular Gasoline | 29805 | 65% | |||||
| City.MPG..Fuel.Type.1. | 45896 | 19 | 10 | 6 | 15 | 21 | 150 |
| Highway.MPG..Fuel.Type.1. | 45896 | 25 | 9.4 | 9 | 20 | 28 | 140 |
| Combined.MPG..Fuel.Type.1. | 45896 | 21 | 9.8 | 7 | 17 | 23 | 142 |
| Fuel.Type.2 | 45896 | ||||||
| ... | 44059 | 96% | |||||
| ... E85 | 1513 | 3% | |||||
| ... Electricity | 296 | 1% | |||||
| ... Natural Gas | 20 | 0% | |||||
| ... Propane | 8 | 0% | |||||
| City.MPG..Fuel.Type.2. | 45896 | 0.85 | 6.5 | 0 | 0 | 0 | 145 |
| Highway.MPG..Fuel.Type.2. | 45896 | 1 | 6.6 | 0 | 0 | 0 | 121 |
| Combined.MPG..Fuel.Type.2. | 45896 | 0.9 | 6.4 | 0 | 0 | 0 | 133 |
| Engine.Cylinders | 45409 | 5.7 | 1.8 | 2 | 4 | 6 | 16 |
| Engine.Displacement | 45411 | 3.3 | 1.4 | 0 | 2.2 | 4.2 | 8.4 |
| Time.to.Charge.EV..hours.at.120v. | 45896 | 0 | 0 | 0 | 0 | 0 | 0 |
| Time.to.Charge.EV..hours.at.240v. | 45896 | 0.11 | 1 | 0 | 0 | 0 | 15 |
| Range..for.EV. | 45896 | 2.4 | 25 | 0 | 0 | 0 | 520 |
| City.Range..for.EV...Fuel.Type.1. | 45896 | 1.6 | 21 | 0 | 0 | 0 | 521 |
| City.Range..for.EV...Fuel.Type.2. | 45896 | 0.17 | 2.7 | 0 | 0 | 0 | 135 |
| Hwy.Range..for.EV...Fuel.Type.1. | 45896 | 1.5 | 20 | 0 | 0 | 0 | 520 |
| Hwy.Range..for.EV...Fuel.Type.2. | 45896 | 0.16 | 2.5 | 0 | 0 | 0 | 115 |
14.3 3D Biplot for 6 clusters
Warning in PCA(data_prepared, graph = FALSE): Missing values are imputed by the
mean of the variable: you should use the imputePCA function of the missMDA
package
After looking at the silhouette plot in the unsupervised learning part, we decided to provide a 3D biplot for 6 clusters, as we can also see in the elbow plot that 6 seem to be optimal in a way. In this biplot, we can observe that it is possible to divide into 6 clusters. When comparing it to the 3D biplot in the ‘results_unsupervised_learning’ part, we clearly notice that cluster 2 could be divided into four smaller clusters, which indicates heterogeneity in this cluster when using only 3 clusters. However, with 6 clusters in hand, it is more difficult to interpret the 4 distinct clusters. In addition to that, it explains the second elbow in the elbow method: at 3 clusters, we obtained optimality, but we get another steep curve between cluster 5 and 6, meaning that selecting 4 or 5 clusters would not be too much of a benefit, but adding a 6th cluster could be worth capturing. Stopping at 3 cluster still is significant for us and it makes our clustering anaylsis more interpretable than 6, that’s why we selected only 3 clusters for our analysis.